Skip to content

perf: optimize table.add_files and inspect.files #2133

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jun 22, 2025

Conversation

jayceslesar
Copy link
Contributor

@jayceslesar jayceslesar commented Jun 21, 2025

Should help with #2130 and #2132

Modifies Table.add_files to explicitly use inspect.data_files and also parallelize inspect._files

I didn't see anywhere else where looping over manifest entries was parallelized, so seems better to parallelize across manifests than within.

No changes here but should be faster.

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for adding the optimization

Co-authored-by: Kevin Liu <kevinjqliu@users.noreply.github.com>
@kevinjqliu
Copy link
Contributor

@jayceslesar looks like the linter isnt happy, could you run make lint?

@jayceslesar
Copy link
Contributor Author

jayceslesar commented Jun 22, 2025

I wonder if we can eventually replace this portion with rust -- unsure what that implementation supports currently but would be fun to try it out https://github.com/apache/iceberg-rust/tree/main/crates/iceberg/src/inspect

@kevinjqliu
Copy link
Contributor

yep thats the goal eventually. I also think the biggest perf bottleneck is reading the avro metadata files. Once we move that part to rust (apache/iceberg-rust#1328), Im curious to see how much the entire process improves.

@kevinjqliu kevinjqliu changed the title perf: table.add_files and inspect.files perf: optimize table.add_files and inspect.files Jun 22, 2025
@kevinjqliu kevinjqliu merged commit e937f6a into apache:main Jun 22, 2025
10 checks passed
@kevinjqliu
Copy link
Contributor

Thanks @jayceslesar

btw i change the PR description so we dont close the 2 issues automatically

amitgilad3 pushed a commit to amitgilad3/iceberg-python that referenced this pull request Jul 7, 2025
Should help with apache#2130 and apache#2132

Modifies `Table.add_files` to explicitly use `inspect.data_files` and
also parallelize `inspect._files`

I didn't see anywhere else where looping over manifest entries was
parallelized, so seems better to parallelize across manifests than
within.

No changes here but should be faster.

---------

Co-authored-by: Kevin Liu <kevinjqliu@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants